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Abstract 



Mean-field variational inference is widely used for approximate posterior inference in many 
probabilistic models. When the model is conditionally conjugate, variational updates are 
in closed-form. However, many models of interest are nonconjuate and practitioners may 
face the challenges of deriving the corresponding variational updates. In this paper, we 
develop and study two generic variational strategies for nonconjugate models — Laplace 
variational inference and delta method variational inference — which place minimal conditions 
on the model. These strategies extend and unify existing methods that were derived for 
specific models. We illustrate our approach on the correlated topic models, Bayesian logistic 
regression, and hierarchical Bayesian logistic regression. Our experimental results show that 
our methods work well on real-world datasets. 

Keywords: Variational inference, Nonconjugate models, Laplace approximations, The 
delta method 



1 Introduction 



Mean-field variational inference lets us efficiently approximate posterior distributions in 
complex probabilistic models (Jordan et al. 1999; Wainwright and Jordan, 2008). Applica- 



tions of variational inference are widespread. As examples, it has been applied to Bayesian 



mixtures (Attias, 2000; Corduneanu and Bishop, 2001), factorial models (Ghahramani and 



Jordan, 1997), and topic models (Blei et al. 2003). 



The basic idea behind mean-field inference is the following. First define a family of distribu- 
tions over the hidden variables where each variable is assumed independent and governed by 
its own parameter. Then fit those parameters so the resulting distribution is close to the 
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conditional distribution of the hidden variables given the observations, i.e., the posterior. 
Closeness is measured with Kullback-Leibler divergence. Inference becomes optimization. 

In many settings this approach can be used as a "black box" technique, specifically when we 
can easily compute the conditional distribution of each hidden variable given all of the other 
variables, both hidden and observed. (This class contains the models mentioned above.) For 
such models, which are called conditionally conjugate models, it is easy to derive a coordinate 



ascent algorithm that optimizes the parameters of the variational distribution (Beal, 2003 



Bishop, 2006). This is the principle behind software tools like VIBES (Bishop et al. 2003) 



and Infer.NET (Minka et al. 2010), which allow practitioners to define models of their data 



and immediately approximate the corresponding posterior with variational inference. 

Many models of interest, however, do not enjoy the properties required to take advantage 
of this easily derived algorithm. Such nonconjugate models include Bayesian logistic re- 



gression (Jaakkola and Jordan, 1996), Bayesian generalized linear models (Wells, 2001), 



discrete choice models (Braun and McAuliffe, 2007), Bayesian item response models (Clinton 



et al. 2004 Fox, 2010), and nonconjugate topic models (Blei and Lafferty, 2007). Using 



variational inference in these settings requires algorithms tailored to the specific model 
at hand. Researchers have developed a variety of strategies for a variety of models, in- 



cluding approximations (Braun and McAuliffe 2007 Ahmed and Xing, 2007), alternative 



bounds (Blei and Lafferty 



2007 



quadrature ( Honkela and Valpola , 2004 ) 



Jaakkola and Jordan 1996, Khan et al. , 2010 ), and numerical 



There have been some other recent efforts to examine variational inference in nonconjugate 



models. Paisley et al. (2012a) proposed a variational inference approach using stochastic 



search for nonconjugate models, approximating the intractable integrals with Monte Carlo 



methods. Gershman et al. (2012) proposed a nonparametric variational inference algorithm, 



which can be applied to nonconjugate models. Finally, the recent work of Knowles and 



Minka (2011) presents a message passing algorithm for nonconjugate models, which has 



been implemented in Infer.NET ( [Minka et al. 2010). This technique applies to a subset of 
models described in this paperQ 

In this paper we develop generic approaches to mean-field variational inference for a large 
class of nonconjugate models. We develop two related strategies, both based on Taylor 
approximations. We first develop Laplace variational inference. This approach embeds 
Laplace approximations — an approximation technique for univariate distributions (MacKay 



1992) — within a variational optimization algorithm. We then develop the delta method 



variational inference, which optimizes a Taylor approximation of the variational objective. 
The details of the algorithm depend on how the approximation is formed. Formed one way, 
it gives an alternative interpretation of Laplace variational inference. Formed another way, 



it is equivalent to using a multivariate delta approximation (Bickel and Doksum, 2007) of 
the variational objective^] 



1. It may be generalizable to the full set. However, how to find the required expectations would still have to 
be determined. 

2. The delta method was first used in variational inference by Braun and McAuliffe (20071 in the context of 



the discrete choice model. Our method generalizes their approach. 
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Figure 1: A general graphical model with hidden variables 9 (real- valued) and z, and 
observed variable x. (This graphical model We assume that p(z \ 9) is conjugate 
to p(x | z), but we do not assume that p(9) is conjugate to p(z \ 9). We omit fixed 
hyper-parameters for simplicity. It is straightforward to embed this model into 
more complex models. 



We studied our algorithms with two nonconjugate models: Bayesian logistic regression ( Jaakkola 



and Jordan 1996) (and its hierarchical extension) and correlated topic models (Blei and 



Lafferty, 2007). We found that our methods give better results than those obtained through 



special-purpose techniques. Further, we found that Laplace variational inference usually 
outperforms delta method variational inference, both in terms of computation time and 
fidelity of the approximate posterior. These methods significantly expand the class of models 
for which mean-field variational inference can be easily applied. 



2 Variational Inference in Nonconjugate models 

Consider a generic model with the following joint distribution, 

p(x, z,9) = p(x\z)p(z\9)p(9). (1) 

We assume this model has hidden variables 9 and z, and observed variable x. (The distinction 
between the two hidden variables will be made clear below.) This is illustrated as a graphical 
model in Figure [TJ 

The inference problem is to compute the posterior of 9 and z, p(z, 9\x). This is intractable for 
many models and we must resort to approximation. In variational inference, we approximate 
the posterior by positing a simple family of distributions over the latent variables q(9, z) 
and then finding the member of that family which minimizes the Kullback-Leibler (KL) 



divergence to the true posterior p{9, z\x) (Jordan et al. 1999). 



Mean-field variational inference is simplest and most widely used variational inference 
method. It uses a fully factorized variational family, 

q(z,9)=q(z)q(9). 

Under the standard variational theory, minimizing the KL-divergence between q(z, 9) and 
p(9,z\x) is equivalent to maximizing a lower bound of the log marginal likelihood of the 
observed data x. We obtain this bound with Jensen's inequality, 

logp(x) = log / p{x, z, 9)dzd9 > E q [logp{x, Z, 9)} + H[q) = C(q), (2) 
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where E g [•] is the expectation taken with respect to q and H[q] the entropy. Setting 
dC{q)/dq = shows that the optimal solution satisfies the following, 

q*{9) oc exp {& q{z) [log p(Z\9)p(9)}} , (3) 

q*{z) oc exp {& q{e) [\ogp{x\z)p{z\9)\] . (4) 



Here we have combined the optimal conditions from Bishop (2006) with the particular 
factorization of Equation [TJ 

A model is conditionally conjugate when p{9) is a conjugate prior to p(z\9) and p(z\9) is 
a conjugate prior to p(x\z). By conjugate, we mean that the conditional distribution of a 
variable given its Markov blanket is in the same family as the distribution of the variable in 
the model (Bernardo and Smith [1994 ). For example, the model is conditionally conjugate if 



p{9) is a Dirichlet and both p(z\9) and p(x\z) are multinomials. 

For such models, both q*{9) and q*(z) are available in closed-form (holding the other 
distribution fixed) and are in the same family as their corresponding distributions in the 
model, p{9) and p(z\9). This leads to the traditional coordinate ascent algorithm for mean- 
field variational inference, where we alternately update q{9) and q(z) using Equation [3] and 
Equation [I] flBishop[ |2006[ ) . 



However, this simple coordinate ascent algorithm is not available when the variable 9 is not 
part of a conjugate pair. In that setting, which arises in many practical models, we cannot 
compute closed-form updates for either distribution. Further, the nonconjugacy makes it 
difficult to evaluate the lower bound C{q) in Eq. [2] In the next sections, we will define a 
wider class of models beyond those that are conditionally conjugate, and we will develop 
generic variational inference for this more general class. 



2.1 A Class of Nonconjugate Models 

We present a wider class of models through a set of assumptions with respect to Eq. [TJ 

1. We assume that 9 is real- valued and the distribution p{9) is twice differentiable with 
respect to 9. If we require 9 > 9q (9q is a constant), we may define a distribution over 
log(0-0 o )- 

2. We assume the distribution p{z\9) is in the exponential family ( |Brown 1986), 



p(z\9) = h(z) exp { V (9) T t(z) - a(r](0))} , (5) 

where h(z) is a function of z; t{z) is the sufficient statistic; rj(9) is the natural parameter; 
and a(n(9)) is log partition function. We emphasize that p(9) is not necessarily the 
conjugate prior. 

3. The distribution p(x\z) is in the exponential family with z as the natural parameter, 

p(x\z) = h(x) exp |z T t(x) — a(z)} , (6) 
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and we require the distribution p(z\9) is conjugate (Bernardo and Smith, 1994). 
Consequently, the term t(z) in Eq. [5] is 

t(z) = [z,-a(z)]. (7) 

Note that t(x) and a(z) are overloaded. They are different from t(z) and a(rj(9)). 



These assumptions are weaker than those of conditional conjugacy. This family includes 



nonconjugate models like the correlated topic model (CTM) (Blei and Lafferty 2007), 



Bayesian logistic regression (Jaakkola and Jordan, 1996), discrete choice models (Braun 



and McAuliffe, 2007), Bayesian ideal point models ( |Clinton et al. 2004), and many others. 
Variational inference is not as straightforward in these models because p{9) is not conjugate 
to p{z | 9). Specifically, the update in Equation [3] does not necessarily have the form of an 
exponential family we can work with and it is difficult to use E g (m \p(z \ 9)] in Equation |4 



We will develop two variational inference algorithms for this class of models: Laplace 
variational inference and delta method variational inference. Both use coordinate ascent to 
optimize the variational parameters, iterating between updating q{9) and q(z). In Laplace 



variational inference, we use Laplace approximations (MacKay, 1992) within the coordinate 



ascent updates of Equation [3] and Equation [4| In delta method variational inference, we 
apply Taylor approximations to approximate the variational objective in Equation [2] and 
then derive the corresponding updates. Different ways of taking the Taylor approximation 
lead to different algorithms. Formed one way, this recovers the Laplace approximation. 



Formed another way, it is equivalent to using a multivariate delta approximation (Bickel 



and Doksum, 2007) of the variational objective function. 



In both algorithms, the variational distribution q(9) is a Gaussian and the variational 
distribution q(z) is in the same family as p(z\9) in Eq. [5j In Laplace variational inference, 
these forms emerge from the derivation. In delta method variational inference, they are 
assumed. 

As for traditional variational inference, our algorithms alternate between updating q{9) and 
q(z). In the following sections, we derive each algorithm for updating q{9). We then show 
how to update q(z). 



2.2 Laplace Variational Inference 

We first review the Laplace approximation. Then we show how to use it in variational 
inference. 



The Laplace Approximation. Laplace approximations use a Gaussian to approximate 



an intractable density (MacKay, 1992). Consider approximating an intractable posterior 
p{9\x). (There is no hidden variable z in this set up.) Assume the joint distribution 
p(x, 9) = p(x\9)p(9) is easy to compute. Laplace approximations use a Taylor approximation 
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around the maximum a posterior (MAP) to construct a Gaussian proxy for the posterior. 
They are typically used in univariate settings. 

First, notice the posterior is proportional to the exponentiated log joint 

p(9\x) = exp{logp(#|a;)} oc exp{logp(#, x)}. 

Let 9 be the MAP of p(9\x), found by maximizing \ogp(9,x). A Taylor expansion around 9 
gives 

log P (9\x) « logp(9\x) + \(9 - 9) T H(9)(9 - 9). (8) 
The term H(9) is the Hessian of \ogp(9\x) evaluated at 9, H(9) = V 2 \ogp(9\x)\„B- 

In the Taylor expansion of Eq. [ij the first-order term (9 — 9) T V logp(9\x)\ 9= Q equals zero. 
The reason is that 9 is the maximum of logp(9\x) and so its gradient V logp(9\x)\ Q= g is zero. 
Exponentiating Eq. [8] gives the approximate Gaussian posterior 

p(9\x) « i exp {-1(9 - 9) T (-H0)) (9-9)}, 

where C is a normalizing constant. In other words, p(9\x) can be approximated by 

p(9\x)^M(9,-H(9)- 1 ). (9) 

This is the Laplace approximation. While powerful, it is difficult to use in multivariate 
settings, for example, when there are discrete hidden variables. Now we describe how we 
use Laplace approximations as part of variational inference for a complex model. 



Laplace updates in variational inference. Laplace approximations have been used 
in approximate inference in more complex models. Smola et al. (2003) used them to 



approximate the difficult-to-compute moments in Expectation propagation (Minka, 2001) 



Rue et al. (2009) used them for latent Gaussian models. Here we want to use them for 



variational inference that can be applied to a wider range of nonconjugate models, including 
those with discrete hidden variables. 

We use Laplace approximations to update the variational distribution q(9). First, we combine 
the coordinate update in Eq. [3] with the exponential family assumption in Eq. [5j 



Now define 



q(9) oc exp {v(9) T E q(z) [t(z)] - a( V (9)) + logp(9)} 



t z ± E q{z) [t(Z)\ , 
f(9)^r } (9) T t z -a(r 1 (9)) + \ogp(9). 



(10) 

(11) 
(12) 



We approximate q(9) by taking a second-order Taylor approximation of f(9) around its 
maximum, following the same logic as from Equation^) Let 9 be the value that maximizes 
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f(9) and V 2 /(#) be the Hessian matrix evaluated at 9. Adapting Eq. [9] and Eq. 
setting gives 



10 



q(9) « M [9, -V 2 /(6>) 



-1 



to this 



(13) 



The Gaussian form of q(9) stems from using the Taylor approximation. This is an approximate 
update for q(9) and can be embedded in a coordinate ascent algorithm for a nonconjugate 
model. Notice we need to use numerical optimization to obtain 9. 



2.3 Delta Method Variational Inference 



In Laplace variational inference, the variational distribution q{9) Eq. 13 is solely a function 



of 9, the maximum of f(9) in Eq. 12 A natural question is, would other values of 9 be 
suitable as well? To consider such alternatives, we describe a different way of performing 
variational inference. We approximate the variational objective C in Equation [2] and then 
optimize that approximation. 

Again we focus on q(9) and postpone the discussion of q{z). The variational distribution 
q{9) is a Gaussian q{9) = M(jx, S). We isolate the terms related to q{9) in the objective, 
then substitute the exponential family assumption about p(z\9) in Equation [5] into C(q) in 
Equation [2j 

C(q{9)) =E q{9) [ V (9) T M q{z) [t(z)\ - a(r,(9)) + logp(0)] + H[q(9)}. 
The entropy of the Gaussian is H[q{9)} = \ log |E| + C, where C is a constant. Notice the 



first three terms are the same function f(9) defined in Eq. 12 We simplify the lower bound 
C(q(9)), 

£(q(9))=E q{e) [/(0)] + ±log|£|. 

We cannot easily compute the expectation in the first term. We use a Taylor expansion of 
f(9) around a chosen value 9, 

f(9) « f(9) + v/(0)(0 -9) + 1(9- 9) T ^f(9)(9 - 9). 

With this Taylor approximation, C(q{9)) can be approximated with 

C(q{9)) « f(9) + V j^) T (p - 9) + |( M - 9) T ^f(9)(f, - 9) 



+ |(Tr 



{v 2 /Ws}+log|S|), (14) 



where Tr(-) is the Trace operator. This is the function we optimize with respect to the 
variational parameters of q(9), {fj,, £}. 

The form of this optimization, however, depends on how we choose 9, the point around 
which to approximate f(9). Unlike in Laplace variational inference, 9 can be any value. We 
will describe three options. 
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One choice is to set 
in Equation 



14 



gives jj, 



to be the maximum of f(0). Then maximizing the approximation 
8 and £ = — V 2 /(0) _1 . Notice this is the update derived in 



Section 2.2 It gives a different derivation of Laplace variational inference. 



A second choice is to set 8 = fj,, i.e., the mean of the variational distribution q{6). With this 
choice, the variable around which we center the Taylor approximation becomes part of the 
optimization problem. The objective in Eq. |14|is 



C(q(0)) « /(/i) + |Tr { V 2 /(/i)£} + |log|S|. 



(15) 



This is the multivariate delta method for evaluating ^ q m) [f(8)j (Bickel and Doksum, 2007) 
and so we call this choice delta method variational inference. 

In delta method variational inference, we optimize C(q(8)) with coordinate ascent on ji and 
S. We use gradient methods to find //. (Note this is more expensive than Laplace variational 
inference because it requires the third derivative V 3 /(^).) Then, given a value of /j,, we 
optimize £ in closed form £ = — V 2 /(//) _1 . Braun and McAuliffe (2007) were the first to 



use the delta method in a variational inference algorithm, developing this technique for the 
discrete choice model. If we assume p(8) is Gaussian then we recover their algorithm. With 
the ideas presented here, we can now use this strategy in many models. 

A final choice is to guess 8, for example, as the mean of the variational distribution from 
the previous iteration of coordinate ascent. If p{8) is a Gaussian distribution, this recovers 
the updates derived in Ahmed and Xing (2007) for the correlated topic model j^] We found 
this simple guess did not work very well on our experiments. (It did not always converge, 
probably due to the difficulty of choosing an appropriate initial 8.) We thus focus on Laplace 
and delta method variational inference. 



2.4 Updating q(z) 



We have derived variational updates for q{8) using two methods. We now turn to the update 



for q(z). We will show that Laplace (Section 2.2) and delta method variational inference 
(Section |2.3[) lead to the same update form for q(z). Further, we have implicitly assumed 



that ^q(z) [t( z )] i n Eq. 11 and Eq. 12 is easy to compute. We will confirm this as well. 



Laplace variational inference. First, we apply the exponential family form in Equation[5] 
to the exact update of Eq. |4j 

logg(z) = C + logp(x\z) + log h(z) + E q(e) [ V {8)} T t(z), 

where C is a constant not depending on z. Now we use p(x\z) from Eq. [6] and t(z) from 
Eq. [7] to obtain 

q(z) oc h(z) exp { (E q(e) [ V (8)} + [t(x), 1]) T t(z)] , (16) 
3. This is an alternative derivation of their algorithm. They derived these updates from the perspective of 



generalized mean-field theory (Xing et al. 20031 
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Initialize variational distribution q(9) and q{z) 
repeat 

Compute the statistics E q ( z ) \t(z)\ in Eq. 
To obtain g(0), 
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Option a): in Laplace variational inference, compute Eq. 13 



Option b): in delta method variational inference, optimize Eq. 15 
Approximate the statistics E g (m [r}{9)\ in Eq 



Update q(z) in Eq. 16 



17 



until Convergence (See Section 2.5 for the criteria), 
return q(9) and q(z). 



Figure 2: The approximation framework in variational inference. 



which is in the same family as p(z\9) in Eq. [5j This is the update for q(z). Further, because 
it is in a simple exponential family form, we can compute ^ q ( z ) [t(z)] from Eq. 
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We have additionally assumed the tractability of IE^g) [r](9)]. To approximate this, we take 
a Taylor approximation of r/(8) around the variational parameter /i, 

n(e) « n{p) + v v ( M ) T (6 - M ) + \{e - m) t v 2 77(^)(0 - M ). 

Since q(6) = M (ji, S), this means that 

foW] « »/0*) + 5 Tr { V 2 ^)^} • (17) 
(Note that the linear term E q ^ [vry(/i) T (6' - fi)] = 0.) 



Delta method variational inference. Using delta method variational inference to 
update q(6), the update for q(z) is identical to that in Laplace variational inference. We 
isolate the relevant terms in Eq. [2j 



C(q(z)) =E q(z) log P (x\z)+logh(z)+E q{6) [rj(e)} T t(z) + H[q(z)}. 



(18) 



Setting the partial gradient dC(q(z))/dq(z) = gives the same optimal q(z) of Eq. |4j Com- 
puting this update reduces to the approach for Laplace variational inference in Equation 16 



2.5 The Algorithm 

We have described updates for q(8) and q(z), which can be embedded in an iterative algorithm. 
See Figure [2] for an outline of this algorithm. We have reduced deriving variational updates 
for complicated nonconjugate models to somewhat mechanical work — calculating derivatives 
and calling a numerical optimization library]^] Regarding the variants of our algorithm, 

4. Python implementations of our algorithm for the example models described in Section [3] are available at 
http : //www. cs . emu . edu/ ~ chongw/ sof tware/noncon jugate- inference . tar . gz 
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Laplace variational inference is simpler to derive because it only requires first derivatives 
of the function in Eq. |12[ while delta variational inference requires third derivatives. We 
empirically study the differences between these methods in Section [4j 

We monitor the convergence of our algorithm in Figure [2] using the following criteria, 

• The change of the mean of \E q [9] | is less than 0.00001. 

• The relative change of an approximate variational objective (explained below) is less 
than 0.00001. 



Our approximate variational objective is, 

C(q(z, 9)) + V/(#) T (/, - 9) + §(// - 9) T V 2 f(9)(v - 9) 

+ \ (Tr {V 2 /(^)S} + log |£|) + H(q(z)), (19) 

where f(9) is defined in Eq. [l2j q(9) = A/"(/x, X) and H(q(z)) is the entropy of the variational 
distribution q(z). We obtain this approximati on b y combining the analysis for q{9) in Eq. 14 
and for q(z) in Eq. 



18 



As outlined in Section 



2.3 



the choice of 9 determines whether we use 



Laplace or delta method variational inference. In both cases, the approximate lower bound 
is computed the same way. Finally we limit the maximum number of iterations to 100. 

We note that our algorithms are not associated with a strict lower bound of the marginal 



probability of the observations. However, as also observed in Braun and McAuliffe (2007) 



we found that they converge in practice by monitoring Eq. 19 Figure [3] shows the plots of 
this quatity for one document during the training for the correlated topic model (CTM). 



(For CTM, the algorithm in Figure [2] applies to a single document. See Section 3.1 for 
more details.) Further, Eq. 19 is not the function we are optimizing — in fact the true 



global objective function is not clear. Even the simpler Laplace approximation is not clearly 
minimizing a well-defined distance function between the approximate Gaussian and true 
posterior (MacKay 1992[ ). Thus, while this approach is an approximate coordinate ascent 



algorithm, clearly characterizing the corresponding objective function is an open problem. 



In the following extreme case, Eq. 19 becomes the true global objective; if we assume the 
variational distribution q{9) = 5(9 — [i), i.e., a delta distribution^] both Laplace and delta 
method variational inference reduce to a maximum a posterior (MAP) estimate of 9. 



3 Example Models 



We have described a generic algorithm for approximate posterior inference in nonconjugate 
models. We now discuss several nonconjugate models from the research literature for which 
we can apply our method. For each model, we identify the hidden variables z and 9 from 
Eq. [5j the observation x from Eq. [fjj and the form of f(9) from Eq. 12 (We give its 



derivatives in the appendix.) In the next section, we study how our algorithm perform when 
analyzing data under these models. 



5. In this case, the entropy of q(6) is 0. 
distributions in variational inference. 



See Welling et al. ( 2008 ) for more discussions on using delta 
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Figure 3: The approximate variational objective from Eq. 19 as a function of the iteration. 
This is for the correlated topic model on a single document (see Section 3.1). 



3.1 Correlated Topic Models 



The correlated topic model (C TM) (iBlei and Lafferty 2007) is an extension of latent Dirichlet 
allocation (LDA) (Blei et al. 2003). A topic model is a probabilistic model of a document 
collection. Each document is treated as a collection of observed words that are drawn 
from a mixture model. The mixture components, called "topics," are distributions over 
terms that are shared for the whole collection; however, each document exhibits them with 
individualized proportions. 

Conditioned on a corpus of documents, the posterior topics place high probability on words 
that are associated under a single theme (e.g., sports or health or business); the posterior topic 
proportions reflect how each document exhibits those themes. This posterior decomposition 
of a collection can be used for summarization, visualization, or forming predictions about a 
document. See Blei (2012) for a review of topic modeling. 



The per-document topic proportions are a latent variable. In LDA these are given a Dirichlet 
prior, which makes the model conditionally conjugate. The CTM extends LDA by replacing 



the Dirichlet prior on the topic proportions with a logistic normal prior (Aitchison 1982). 
This is a richer prior that can capture correlations between occurrences of the components. 
(For example, a document about sports is more likely to also be about health.) While the 
CTM is more expressive, it is no longer conditionally conjugate. 

We now cast the CTM as a type of generic model from Section [2j Suppose there are K 
topic parameters Px-.K (fixed for now), each of which is a distribution over V terms. Let 
7r(9) denote the multinomial logistic function, which maps a real-valued vector to a point 
of the simplex with the same dimension, tt(9) oc exp{#}. The CTM assumes the following 
generative process of a document: 



1. Draw log topic proportions 9 ~ AA(/irj, £ 



o ■ 
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2. For each word n: 

(a) Draw topic assignment z n \ 6 ~ Mult(7r(0)). 

(b) Draw word x n \ z n , (3 ~ Mult(/3 Z? J. 



The topic proportions tt(8) are drawn from a logistic normal distribution; their correlation 
structure is captured in So- The topic assignment variable z n indicates which topic the 
nth word is drawn from. In terms of our earlier notation, is identical; the earlier z is the 
collection of topic assignments ZiW, the observation x is the collection of words xi-.n. This 
model satisfies the conditions in Section [2.11 



The variational distribution for parameter 8 is q(6) = A/"(/u, X); and the variational distri- 



bution for the z variables is discrete. In delta method variational inference, as in Braun 



and McAuliffe (2007), we restrict the covariance matrix X to be diagonal. This is because 



a full covariance matrix makes the derivative of Eq. 15 (with respect to u) expensive to 
compute. Laplace variational inference does not require this simplification. The detailed 
derivations of the algorithm are in Appendix A. Note that, besides CTM, this approach can 
be used in a variety of nonconjugate topic models, including Dirichlet-multinomial regression 



et al. 2012b) 



(DMR) (Mimno and McCallum, 2008), Dynamic topic models (DTM) (Blei and Lafferty 



2006. Wang et al. 2008) and the discrete infinite logistic normal distribution (DILN) (Paisley 



Note that for CTM, this algorithm is only for inference at the document level. As in Blei 



and Lafferty| ( |2007D , we estimate the corpus- level topic parameters (3\-k and logistic normal 
parameters (//q> Xq) with variational EM. 



3.2 Bayesian Logistic regression 



Bayesian logistic regression is a well-studied model for binary classification ( Jaakkola and 



Jordan, 1996). Let t n is be a p-dimensional observed feature vector for the nth sample and 



z n be its class (an indicator vector of length two) . Let 6 be the real- valued parameter vector 
in W; there is a component for each feature. The usual Bayesian logistic regression is the 
following: 



p(#)=A%o,X ), 

P {z n | e,t n ) = a(e T t n y^a(-e T t n y^, vn 

where a(y) = 1/ (1 + exp(— y)) is the logistic function. 



(20) 



This is a subset of the model class in Section [2j In our earlier notation, 9 is identical and z are 
the observed classes of each data point. There is no additional no additional observed variable 
x downstream. The variational distribution need only be defined for 6, q(9) = N(/j>, X). Using 
Laplace variational inference, our approach recovers the standard Laplace approximation 



for Bayesian logistic regression (Bishop, 2006). (This gives a connection between standard 



Laplace variational inference and variational inference.) Delta method variational inference 
provides an alternative. The detailed derivations are in Appendix B. 
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An important extension of Bayesian logistic regression (and Bayesian generalized models) is 
hierarchical Bayesian logistic regression. The hierarchical variant considers multiple related 



logistic regression problems and solves them simultaneously (Gelman and Hill, 2007). We 



will take an empirical Bayes approach. We treat the parameters for each problem from a 
shared prior, and then estimate the hyperparameters of that prior with maximum likelihood 
or MAP. With M related problems, we construct the following hierarchical model: 



1. Draw the global hyperparameters: 



v-l 



Wishart(i>, $ c 



Mo-A^Eo) 

2. For each problem m: 

(a) Draw coefficients 6 m ~ Af(fM), So) 

(b) Draw class labels Zi-.n, conditioned on covaxicitGS ti:iv" 



We construct f(9 

Sq using regularized variational EM (Bishop 



in Eq. 12 separately for each m, and fit the hyperparameters /io and 

This amounts to MAP estimation, 



2006). 



with priors as specified above. We note that logistic regression is a generalized linear model 
with a binary response and canonical link function (McCullagh and Nelder, 1989). It is 



straightforward to use our algorithms with other Bayesian generalized linear models (and 
their hierarchical variants). 



4 Empirical Study 

We studied the Laplace and delta method variational inference for the CTM and Bayesian 
logistic regression model on several data sets. 



Correlated topic models (CTM). For the CTM, we analyzed two document collections. 
The Associated Press data contains 2,246 documents from the Associated Press, with 436K 
observed words and a vocabulary size of 10,473 terms. The New York Times data contains 
9,238 documents from the New York Times, with 2.3 million observed words and a vocabulary 
size of 10,760 terms. 



We measured per-word held-out log likelihood to evaluate the models' fit to the data. For 
each collection, we held out 20% of the documents as a test set. We split each document 
in Vtest into halves, Wi = (wdi,Wd2), and computed the log likelihood of the words in Wd2 



(independently) conditioned on Wd\ and Ptram, similar to Asuncion et al. (2009) and Blei 



and Lafferty (2007). A better predictive distribution gives higher likelihood to the second 
half - 
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Figure 4: The approximate lower bound (y axis) during training for CTM when the number 
of topics K = 60 (Others are similar); this approximate lower bound is the sum of 



the per-document approximate lower bound for each document in Eq. 19 The 
iteration (x axis) indicates the overall number of passes of the entire corpus. It 
converged in practice. 



Let nd be the variational expectation of the topic proportions given w d \. The predictive 
probability of w d2 given w dl is approximated by p{w d2 \w dl , T> tTaia ) « H weWd2 J^k^dkPkw 
Then the per-word held-out log likelihood is 

likelihoodp W = Edsx> tes t log p(w d2 \w dl , Arain)/Edec tcst 

This evaluation lets us compare methods regardless of whether they provide a bound. Here 
\w d 2\ is the number of tokens in w d 2- It is a measure of the quality of the estimated predictive 
distribution. Exact computation is intractable, and so we use the following approximation. 
Let Tt d be the variational expectation of the topic proportions given w d2 - The predictive 
probability of w d 2 given w d \ is approximated by 

p{w d 2\w dl ,V tT& i n ) Rj U weWd2 Y^k^dkPkw 



We compared the original method of Blei and Lafferty ( 2007 ) to Laplace and delta method 
variational inference. We varied the number of topics from 20 to 80. We stopped fitting 
when the relative change of the approximate lower bound in the corpus-level EM algorithm 
was smaller than 10e-5. In CTM, this approximate lower bound amounts to be the sum of 
the per-document approximate lower bound for each document in Eq. 19 Figure [4] shows 
some example approximate lower bounds during training. 

Figure [5^a) shows per-word held-out log likelihood as a function of the number of topics. 
On both data sets, the original algorithm of Blei and Lafferty (2007), tailored for this 



model, usually performed worse than both generic algorithms. Our conjecture is that Blei 



and Lafferty (2007) gives a strict lower bound, which might be loose; our method uses an 



approximation which, while not a bound, might be closer to the objective and give better 
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Figure 5: Comparison on per-word held-out log likelihood. Higher numbers are better. 

Laplace variational inference is "Lap-Var"; delta method variational inference 
is "Delta- Var"; the method by Blei and Lafferty in Blei and Lafferty (2007) is 
BL. (a) Held-out likelihood against number of topics. Delta- Var performs better 
BL. Lap-Var performs best, (b) Held-out likelihood against running time when 
K = 60. Lap-Var is faster and gives performs better than Delta- Var. 



predictive distributions. Laplace variational inference was always better than both other 
algorithms. 

We also compared the running times of Laplace variational inference and delta method 
inference. For a model with 60 topics, Figure [5^b) shows the per-word held-out log likelihood 
as a function of timej^] (Other topic settings looked similar.) In addition to finding a better 
fit to the data, Laplace variational inference is faster. 



Bayesian logistic regression We studied our algorithms on Bayesian logistic regression 
in both standard and hierarchical settings. In the standard setting, we analyzed two datasets. 



The Yeast data (Elisseeff and Weston, 2001) is formed by micro-array expression data and 
phylogenetic profiles. There are 1,500 genes in the training set and 917 genes in the test set. 
The input dimension is 103. One gene is associated with up to 14 different edges (14 labels), 



corresponding to 14 independent binary classification problems. The Scene data (Boutell 



et al. 2004) contains 1,211 training and 1,196 test images, with 294 images features and up 
to 6 scene labels per image, corresponding to 6 independent binary classification problems^] 

We used two metrics. First we measured accuracy, which is the proportion of test-case 
examples correctly labeled. Second, we measured average log predictive likelihood. Given a 
test-case input t with label z, we compute the log of the predictive likelihood, 

log p(z \n,t) = zi log aif^t) + z 2 log cr(-/i T t) , 



We did not formally compare the running time of |Blei and Laffert y | |2007| l's method because we used the 
authors' C implementation, while ours is in Python. We observed, however, that their method took more 
than five times longer than ours. 



7. The Yeast and Scene data are at http://mulan.sourceforge.net/datasets.html 
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where \x is the mean of variational distribution q(9) = M(fi, £). Higher likelihoods indicate a 
better fit. For both accuracy and predictive likelihood, we used cross validation to estimate 
the generalization performance of each inference algorithm. We set the priors = and 
So = I. 

We compared generic Laplace variational i nference (Section |2.2[), delta method variational 



inference (Section 2.3), and the method of Jaakkola and Jordan (1996). This last method 



preserves a lower bound on the marginal likelihood with a first-order Taylor approximation, 
and was developed specifically for Bayesian logistic regression. Table [T] gives the results. 
Laplace variational inference and delta method variational inference gave slightly better 
accuracy than Jaakkola and Jordan's method, and much better log predictive likelihood^] 

To study hierarchical Bayesian logistic regression, we analyzed the School data from the 
Inner London Education Authority]^] This data contains examination records from 139 
secondary schools for the years 1985, 1986 and 1987. It is a random 50% sample with 
15,362 students. The students' features contain four student-dependent features (year of 
the exam, gender, VR band (individual prior attainment data), and ethnic group) and four 
school-dependent features (percentage of students eligible for free school meals, percentage 
of students in VR band 1, school gender and school denomination). We coded the binary 
indicator of whether each was below the median ("bad") or above ("good"). We use the 



same 10 random splits of the data as Argyriou et al. (2008) 



In this data, we can either treat each school as a separate classification problem, pool all 
the schools together (as a single big school), or analyze them with the hierarchical logistic 
regression structure described in Section [3. 2| For the non- hierarchical settings, we studied 
Jaakkola and Jordan's method, Laplace variational inference, and delta method variational 
inference. When treating the problem hierarchically, we studied the corresponding Laplace 
and delta method variational inference algorithms. Table [2] gives the results. Hierarchical 
logistic regression performed best in terms of both accuracy and predictive log likelihood. 



5 Conclusions 



We have developed Laplace and delta method variational inference, two strategies for 
variational inference in a large class of nonconjugate models. These methods approximate 
the variational objective function with a Taylor approximation, each in a different way. We 
studied them in two nonconjugate models and showed that they work well in practice, forming 
approximate posteriors that lead to good predictions. In the examples we analyzed, our 
methods worked better than methods tailored for the specific models at hand. Further, we 
showed that Laplace variational inference is better and faster than delta method variational 
inference. 



Previous literature, e.g., (Xue et al 



2007 



Archambeau et al. 



2011 1 also treat Yeast and Scene as 



multi-task problems. This is slightly non-standard, since each sub-problem shares the same input features. 
In our experiments, we found that standard Bayesian logistic regression performed competitively with 
the multi-task algorithms reported in |Archambeau et al.| (|2011|). 
The data is available at http:/ /multilevel. ioe.ac.uk/intro/datasets. html 
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Table 1: Comparison of different methods on standard Bayesian logistic regression using accuracy 
(Acc.) and averaged log predictive likelihood (Lik.) — the higher, the better. Results 
are averaged from five random starts. Variance is too small to show. The best results 
are highlighted. Lap-Var and Delta- Var gives slightly better accuracy but much better 
predictive log likelihood. 





Yeast 


Scene 


Algorithm 


Acc. Lik. 


Acc. Lik. 


Jaakkola 
Lap-Var 
Delta- Var 


79.7% -0.678 
80.1% -0.449 
80.2% -0.450 


87.4% -0.670 
89.4% -0.259 
89.5% -0.265 



Table 2: Comparison of different methods on School data using accuracy (Acc.) and averaged log 
predictive likelihood (Lik.). Results are averaged from 10 random splits. Variance is too 
small to show. The best results are highlighted. Lap-Var-all indicates the case where we 
treats all schools together. The hierarchical approach performs the best. 





School 


Algorithm 


Acc. 


Lik. 


Jaakkola 


70.5% 


-0.684 


Lap-Var 


70.8% 


-0.569 


Delta- Var 


70.8% 


-0.571 


Jaakkola-all 


71.2% 


-0.685 


Lap-Var-all 


71.3% 


-0.557 


Delta- Var-all 


71.3% 


-0.557 


Hier-Lap-Var 


71.9% 


-0.549 


Hier-Delta-Var 


71.9% 


-0.559 



This method expands the scope of variational inference. One area of future work is to further 
expand that scope, by relaxing the assumption of full conditional conjugacy in z. In such a 
setting, we may be able to use moment matching (Tierney et al. 1989) to develop efficient 
variational algorithms. 



Appendix A: The correlated topic model 



In correlated topic models (CTM) (Blei and Lafferty, 2007), there are K topic parameters 
(3\-K (fixed for now), each of which is a distribution over V terms. Let n be the topic 
proportions for a document and n be the index of an observed word x n . The CTM assumes 
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the following generative process of a document, 

#~AA(/io,X ), 
7r oc exp(#) 

z n 1 7r ~ Mult(7r), 

x n |z„,/3~Mult(/3 z J. 

In this model, the topic proportions tt are drawn from a logistic normal distribution. Their 
correlation structure is captured in So- The variable z n indicates which topic the nth word 
is drawn from. We can now identify the quantities from Eq. 5, Eq. 6 and Eq. 7 that we 
need to compute f{9) in Eq. 12 

h(z) = 1, a(z) = 0, t(z) = J2 n z ni 

r,(9) = 6 -log {J2 k exp{0 fc }} , a( V (9)) = 0. 

With this notation, function f(9) defined in Eq. 12 is, 

f(0) = v (e) T i z -l(e- u ) T s \e - Mo ), 

where t z is the expected word counts of each topic under the variational distribution q(z). 

Using diTi/d9j = ^(lr^-i — Wj), we obtain the gradient and Hessian for function f(9) in 
CTM, 

where 1 [i=j] = 1 if i = j and otherwise. Note the first V/(0) is enough for Laplace 
variational inference. 

In delta method variational inference, we also need to compute the derivative of 
Trace {v 2 /(fl)S} = (- £f =1 vr fc S fcfc + tt t Z<k) £* =1 [t z ] k - Trace(£ 



We assume £ is diagonal for the delta method for simplicity (Braun and McAuliffe, 2007). 
(Note for Laplace method, we don't need this assumption.) This gives us 

9Trace{v 2 /(#)£} 



dOi 



7Tj(l - 2lTi)(52 k TTfcSfcfc - 1) 



The algorithm in Fig Figure [2] for CTM only applies to a single document. As we discussed 
in the main text, to complete whole algorithm for CTM, we employ the same variational 
EM algorithm as in Blei and Lafferty (2007). The E-step corresponds to the algorithm 
in Figure [2} The M-step, i.e., the estimation of Puk, amounts to, 

N d 

Pkw oc l[x dn = w]q(z dn = k). 

d ro=l 

where d is the index of the documents in the collection, N d is the number of words in 
document d and w is the word index in a predefined vocabulary. 
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Appendix B: Bayesian logistic regression 

In Bayesian logistic regression, let t n is be a p-dimensional observed feature vector for the 
nth sample and z n be its class (represented as an indicator vector of length two). Let 8 be 
the real- valued parameter vector in M p ; there is a component for each feature. The usual 
Bayesian logistic regression model is the following: 

p(9)=M( t i Q ,X ), 
p(z n | 6,t n ) = a(9 T t n ) z ^a(-6 T t n y^, Vn 

where a(y) = 1/(1 + exp(— y)) is the logistic function. 

In Bayesian logistic regression, we can fit the distribution of the observations z\ : n into the 
exponential family with 

h(z) = l, a(z) = 0, t z = t(z) = [zi, . . . ,z N ], 
n(9) T = [log a(0 T t n ), log a(-e T t n )}% =1 , a(n(9)) = 0, 

In this set up, i z represents the whole set of labels. With this notation, the f(8) defined in 
Eq. 12 is 

f(8) = V (0) T i z -l(e- M ) T S \8 - mo). 
The gradient and Hessian for function f(6) in Bayesian logistic regression are 

V/(0) = En =1 ^ [Zn,l ~ cr(9 T t n )) - S Q 1 (9 - U ), 

V 2 /(^) = - En=i a(9 T t n )o(-9 T t n )t n t T n - So K 
Note the first V f(9) is enough for Laplace variational inference. This is the standard Laplace 



approximation to Bayesian logistic regression (Bishop, 2006). 

For delta variational inference, we also need the gradient for Trace {v 2 /(#)£} is 

N 

ddi 



dTracejv 2 /(0)E| A , T , T w , T s T 



n=l 

Note that the "diagonal" assumption for £ is not needed. 
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